home *** CD-ROM | disk | FTP | other *** search
-
-
-
-
-
-
-
-
-
- Keeping watch over the flocks
- at night (and day)
-
-
- Kenneth Ingham
- University of New Mexico Computing Center
- Distributed Systems Group
- 2701 Campus NE
- Albuquerque, NM 87131
- (505) 277-8044
- ingham@charon.unm.edu
- ucbvax!unmvax!charon!ingham
-
- Topic Areas: Applications, System management, Utilities
-
-
-
- The computing facilities offered by the University of New
- Mexico Computing Center include three microvaxen, five large
- vaxen (780 or bigger), and a Sequent B8000. In addition to
- these Unix/VMS machines, the UNMCC Distributed Systems Group
- (DSG) monitors a number of the various microvaxen and sun
- workstations scattered across campus. This duty falls to
- the DSG Programmer designated as "DOC", or "DSG On Call",
- who receives his beeper based on a monthly rotation
- schedule.
-
- In the past, shell scripts running every six hours reported
- various system statistics to DOC, who then scanned the out-
- put for signs of possible trouble. As the number of
- machines and the number of potential problems grew, the
- mound of output that DOC had to process, most of which mere-
- ly indicated normal system operation, became overwhelming.
- Now, with several machines to monitor and only one person
- acting in this capacity, DOC can often waste a tremendous
- amount of time wading through system status reports, time
- which can be better spent actually fixing system problems.
-
- In response to this situation, the author developed a tool
- which introduces some intelligence into the machine's self-
- reporting, letting the machine filter out messages indicat-
- ing normal operation and forwarding to DOC only those mes-
- sages which point out trouble areas. The result of these
- efforts is Watcher, a very general and extensible system
- self-monitor. Running more often than the set of shell
- scripts, Watcher keeps closer tabs on the system; since it
- delivers only a summary of potential problems, however, this
- extra monitoring produces _n_o corresponding increase in the
-
-
-
-
- 1
-
-
-
-
-
-
-
-
-
-
- demand on the system manager. No problems slip by unnoticed
- in the more concise output, leading to an improvement in
- overall system availability as well as the more effective
- utilization of the system manager's time.
-
- Watcher was designed to be almost as flexible as DOC in de-
- ciding what constitutes a problem with the system. Running
- at intervals specified in crontab, Watcher issues a number
- of user-specified commands (each of which delivers its out-
- put in a different format), parsing all or part of the out-
- put from either the left or the right. It compares this to
- the last such output obtained, checking for indications of a
- system abnormality. Such signs might take the form of a too
- abrupt change in a certain value (e.g. a process which sud-
- denly begins gobbling vast amounts of cpu time), a value
- which exceeds the allowable maximum or minimum (such as a an
- overly-full file system), or an unacceptable change in a
- string value (e.g. when "up" changes to "down"). For com-
- mands such as "ps" whose output varies considerably with
- each run, specific parts of the output can be designated as
- a key; successive runs of Watcher will home in on these key
- areas for their comparisons.
-
- Since the user specifies not only the commands Watcher will
- execute and the time lapse between successive runs, but also
- the aforementioned parameters which indicate system
- anomalies, Watcher can easily be seen as a very flexible,
- general system monitor. Its use at UNM has provided a
- marked increase in the productivity of the system manager,
- which has led in turn to the increase in the reliability and
- availability of the systems at UNMCC.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- 2
-
-
-
-
-